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Abstract 


OpenMP  is  a  proposed  industry  standard  Application  Programmer  Interface 
(API)  that  supports  shared-memory  parallel  programming  in  Fortran  and  C/ C++ 
on  architectures  including  Unix,  Linux,  and  Windows  NT  platforms.  This  report 
discusses  experiences  using  OpenMP  implementations  on  Shared  Resource 
Center  (SRC)  platforms.  The  experiences  include  running  OpenMP  benchmarks, 
as  well  as  using  OpenMP  with  applications.  Tools  available  for  debugging  and 
analyzing  OpenMP  programs  are  also  covered.  Most  of  the  results  in  this  report 
should  be  considered  preliminary  and  the  basis  for  further  investigation. 
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1.  Introduction 


OpenMP  is  a  proposed  industry  standard  Application  Programmer  Interface 
(API)  that  supports  shared-memory  parallel  programming  in  Fortran  and  C/  C++ 
on  architectures  including  Unix,  Linux,  and  Windows  NT  platforms.  Jointly 
defined  by  a  group  of  major  computer  hardware  and  software  vendors  who 
make  up  the  OpenMP  Architecture  Review  Board  (ARB),  OpenMP  is  intended 
to  give  shared-memory  parallel  programmers  a  portable,  scalable  programming 
model  and  simple  interface  for  developing  parallel  applications  for  platforms 
ranging  from  the  desktop  to  the  Supercomputer.  (See  reference  [1]  for  more 
information  about  OpenMP.)  OpenMP  compilers  used  here  include  the 
following; 

•  SGI  MIPSPro  73.1.1  Fortran  77  and  Fortran  90  compilers  on  an  SGI  Origin 
2000  and  an  SGI  Origin  3000  running  IRIX  6.5; 

•  IBM  XL  7.1  Fortran  77/90/95  compilers  on  an  IBM  PowerS  SMP  with  eight 
processors  per  node  running  AIX  4.3; 

•  Sun  Forte  6  update  1  Fortran  95  on  a  Sun  HPCIOOOO  running  Solaris  8;  and 

•  KAI  Guide  3.9  Fortran  77  and  Fortran  90  compilers  on  SGI,  IBM,  and  Stm 
platforms. 


2.  Benchmark  Results 


The  EPCC  OpenMP  microbenchmarks  are  intended  to  measure  the  overheads  of 
synchronization  and  loop  scheduling  in  the  OpenMP  nm-time  library  [2].  The 
overhead  measurements  can  be  used  to  compare  the  efficiency  of  die  run-time 
libraries  of  different  OpenMP  implementations  and  give  guidance  on  the 
performance  implications  of  choosing  between  semantically  equivalent  directives 
(e.g.,  CRITICAL  vs.  ATOMIC  vs.  lock  routines).  Much  of  tiiese  benchmarks 
address  the  barrier  implementations  in  OpenMP.  However,  the  overhead  itself 
may  not  be  an  indication  of  how  well  an  individual  OpenMP  program  will 
perform.  An  application  program  will  use  a  whole  ensemble  of  directives,  and  its 
performance  cannot  be  predicted  on  the  basis  of  certain  directives  alone. 
However,  these  benchmarks  are  meant  to  give  some  guidance  on  choosing 
directives  to  the  application  programmer  and  give  indications  to  the  vendors  as 
to  where  improvement  in  their  OpenMP  implementations  may  be  needed.  A 
detailed  explanation  of  the  measurement  methodology  can  be  foxmd  in 
reference  [2].  A  brief  explanation  is  given  in  this  report  as  follows.  The  overhead 
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of  a  parallel  program  is  defined  as  Tp-Ts/p,  where  Tp  is  the  parallel  execution 
time,  Ts  the  serial  execution  time,  and  p  is  the  processor  count.  The  overheads  of 
a  nmnber  of  directives  are  measured  in  this  simple  fashion.  Overheads  are 
reported  in  processor  clock  cycles  to  allow  comparison  between  different 
systems. 

The  loop  scheduling  benchmark  measures  overheads  for  STATIC,  DYNAMIC, 
and  GUIDED  schedules  with  different  chunk  sizes.  Results  for  the  Srm  ElOOOO, 
SGI  Origin  3000,  and  IBM  Power3  SMP  for  both  vendor  and  KAI  Guide 
compilers  (where  possible)  are  shown  in  Figures  1  through  5.  From  these 
figures,  it  can  be  seen  that  dyncunic  scheduling  is  expensive,  especially  for  small 
chunk  sizes.  Since  the  default  chunk  size  is  1  for  most  OpenMP 
implementations,  users  need  to  be  careful  to  set  the  chunk  size  to  a  larger  value 
when  using  dynamic  scheduling.  On  the  Origin,  the  overheads  of  dynamic 
scheduling  are  so  large  as  to  render  it  useless,  at  least  with  the  default  setup. 

The  synchronization  benchmark  measures  synchronization  overheads  for  several 
barrier  types  of  directives:  pcirallel,  for,  parallel  for,  barrier,  and  single.  The 
overheads  of  each  of  the  operations  are  measured  for  different  numbers  of 
threads.  Results  for  the  Sun  ElOOOO,  SGI  Origin  3000,  and  IBM  Power3  SMP  for 
both  vendor  and  KAI  Guide  compilers  (where  possible)  are  shown  in  Figures  6 
through  10. 

The  PBN,  or  "Progrctmming  Baseline  for  NPB,"  consists  of  three  sets  of  source 
codes  based  on  the  NASA  Advanced  Super  (NAS)  Computing  Division  Parallel 
Benchmark  version  2.3.  The  PBN  contains  an  improved  sequential 
implementation,  a  sample  OpenMP  implementation,  and  a  sample  HPF 
implementation.  The  directives  inserted  for  the  OpenMP  implementation  reflect 
a  programmer's  parallelization  and  data  distribution  strategy,  while  the  compiler 
is  responsible  for  implementation  and  optimization.  These  benchmarks 
complement  the  EPCC  benchmarks  by  providing  application-oriented 
performance  measures.  Each  application  in  the  benchmark  has  tiuee  problem 
sizes,  which  are  simply  called  A,  B,  and  C,  where  C  is  the  largest  problem.  We 
looked  at  the  Mflop  rate  for  each  problem  set  as  a  fxmction  of  the  number  of 
processors.  The  OpenMP  version  of  the  PBN  benchmarks  has  been  rewritten  by 
the  Real  World  Computing  Program  (RWCP)/Omni  group  in  Japan  to  eliminate 
some  problems.  We  have  not  been  able  to  nm  all  size  C  problems  on  all  of  the 
platforms  due  to  memory  limitations  and  occasional  segmentation  faults.  We 
will  follow  up  on  these  problems.  Some  preliminary  results  for  the  Origin  3000 
are  shown  in  Figures  11  through  14.  These  are  with  STATIC  scheduling  and  the 
default  chunk  size.  The  OpenMP  versions  of  most  of  the  benchmarks  appear  to 
scale  well  for  larger  problem  sizes,  although  the  results  shown  here  are 
somewhat  noisy.  We  plan  to  renm  these  benchmarks  to  try  to  achieve  more 
reliable  results  and  to  compare  scaling  with  the  (message-passing  interface) 
versions. 
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3.  Lessons  Learned 


•  OpenMP  private  variables  are  allocated  on  a  thread's  stack.  The  default 
stack  size  may  not  be  large  enough  for  parallel  regions  with  large  numbers 
of  private  variables  or  regions  that  call  subroutines  with  large  numbers  of 
local  variables  which  are  automatically  private.  Segmentation  faults  are  a 
frequent  consequence  of  using  a  stack  size  that  is  too  small.  Both 
environment  variables  and  run-time  routines  may  be  used  to  modify  the 
default  stack  size,  although  the  manner  in  which  this  is  done  is 
implementation  dependent.  On  the  Origin  3000  with  the  MlPSpro 
compiler,  setting  the  MP_STACK_OVERFLOW  enviroiunent  variable 
causes  the  OpenMP  run-time  system  to  automatically  detect  and  report 
stack  overflow  errors  at  lun-time.  The  MP_SLAVE_STACKSIZE  variable 
or  the  MP_SET_SLAVE_STACKS1ZE  library  routine  can  be  used  to  request 
larger  stack  sizes.  Similar  facilities  are  available  with  some  other  OpenMP 
compilers  (e.g.,  KMP_STACKSIZE  with  Guide). 

•  Hewlet-Packard  currently  does  not  support  OpenMP  for  C.  Its  support  of 
OpenMP  for  Fortran  is  incomplete  and  less  than  perfect.  In  particular,  its 
error  messages  are  extremely  poor  and  generally  only  state  that  an  internal 
error  has  occurred.  This  behavior  insinuates  that  the  compiler  is  broken 
when,  in  fact,  it  could  be  a  bug  in  tiie  application  code. 

Two  examples  of  these  problems  are  the  following: 

(1)  HP  Fortran  77  and  Fortran  90  support  a  limited  number  of  continuation 
lines.  Since  the  limitation  applies  to  compiler  directives  as  well,  a 
problem  can  arise  if  there  are  a  large  number  of  private  variables. 
Unfortunately,  rather  than  stating  what  the  problem  is,  the  compiler 
just  gives  the  internal  error  message. 

(2)  The  compiler  did  not  seem  to  work  well  with  code  generated  by  a  KAI 
tool  that  converted  SGI  directives  to  OpenMP.  When  it  was  specified 
that  variables  should  default  to  SHARED,  matters  improved.  Most  of 
the  error  messages  disappeared,  and  the  job  seemed  to  run  correctly. 

•  On  IBM  systems,  there  is  a  problem  that  if  an  OpenMP  job  tries  to  use  all  of 
the  processors,  then  it  is  competing  with  the  operating  system  for  the 
attention  of  a  processor.  OpenMP  jobs  tend  to  be  relatively  fine-grained; 
thus,  if  the  operating  system  needs  5%  of  a  processor's  attention,  then  the 
other  processors  will  spend  5%  of  their  time  spinning  while  waiting  for  the 
last  thread  to  catch  up.  Obviously,  the  problem  gets  worse  as  the  number 
of  processors  in  the  system  increases  because  the  number  of  processors 
sitting  at  a  spin  lock  increases,  while  tiie  amoimt  of  time  required  by  the 


3 


operating  system  can  also  increase.  On  the  IBM  SP,  problems  are  even 
worse,  such  as  the  following: 

(1)  When  performing  mixed-mode  programming  with  MPI  going  between 
nodes,  servicing  MPI  requests  from  other  processors  will  also  require 
the  attention  of  a  processor,  slowing  things  down  even  more. 

(2)  Asynchronous  transfer  of  data  between  nodes  can  also  put  a  strain  on 
the  memory  system,  which,  in  the  case  of  some  configmations,  is 
already  stretched  fairly  thin. 

•  KAI's  implementation  of  OpenMP  is  based  on  Pthreads.  As  such,  it  should 
add  extra  overhead  relative  to  a  native  implementation.  However,  otir 
benchmark  results  so  far  do  not  show  this  to  be  a  problem,  and  sometimes 
the  KAI  compiler  outperforms  the  vendor  compiler. 


4.  Tools  for  OpenMP 


The  TotalView  debugger  from  Etnus  provides  facilities  for  debugging  OpenMP 
programs  as  well  as  for  mixed  MPI  and  OpenMP  programs  [3].  TotalView  is 
available  for  a  large  number  of  platforms  and  is  installed  on  some 
Shared  Resource  Center  (SRC)  machines.  The  previous  version  (4.1)  had  some 
problems  debugging  threaded  programs  (such  as  OpenMP)  on  some  platforms 
(such  as  SGI),  but  this  problem  appears  to  have  been  fixed  in  version  5. 
TotalView  works  with  both  vendor  and  KAI  OpenMP  compilers. 

The  KAI  KAPPro  toolset  includes  the  Gmde  compiler,  the  Assure  debugger,  and 
the  GuideView  performance  analysis  tool  for  OpenMP,  which  are  described  as 
follows: 

•  Guide  is  a  cross-platform  implementation  of  OpenMP  for  C,  C++,  and 
Fortran. 

•  The  Assure  component  of  the  KAP/Pro  toolset  validates  the  correctness  of 
parallel  OpenMP  programs  and  identifies  programming  errors  that 
occurred  when  parallelizing  a  sequential  application.  The  inputs  to  Assure 
cue  an  OpenMP  peirallel  program  that  is  assumed  to  run  correctly  in 
sequential  mode  and  a  data  set  for  that  program.  When  the  Assure- 
processed  program  is  run.  Assure  simulates  parallel  execution  and 
identifies  errors  where  the  parallel  program  is  inconsistent  with  the 
corresponding  sequential  program.  Assure  can  display  its  results  using  the 
AssureView  graphical  user  interface  or  a  command-line  interface. 

•  The  GuideView  component  provides  an  instrumented  nm-time  library  that 
captures  timing  information  for  detecting  and  diagnosing  performance 
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problems  in  OpenMP  parallel  programs.  The  graphical  interface  provides 
browsing  through  performance  data  to  identify  parallel  regions  or  loops 
that  require  attention. 

Performance  Application  Programming  Interface  (PAPI)  is  a  specification  and 
reference  implementation  of  a  cross-platform  library  interface  to  hardware 
covmters  [4,  5].  These  coimters  exist  as  a  small  set  of  registers  that  count 
"events,"  which  are  occurrences  of  specific  signals,  and  states  related  to  the 
processor's  function.  Monitoring  these  events  facilitates  correlation  between  Ihe 
structure  of  source/ object  code  and  the  efficiency  of  the  mapping  of  that  code  to 
the  underlying  architecture.  This  correlation  has  a  variety  of  uses  in  performance 
analysis  and  tuning.  PAPI  virtualizes  the  cotmters  on  a  per-process  and  per- 
thread  basis  and  can  be  used  for  analysis  of  threaded  programs  including 
OpenMP.  PAPI  is  being  installed  on  some  SRC  machines. 

Vampir  is  a  performance  analysis  tool  for  MPI  parallel  programs  developed  by 
PaUas  in  Germany.  Vampir  is  available  on  some  SRC  machines.  The  next 
version  of  VAMPIR  will  support  OpenMP  in  addition  to  MPI.  Pallas  and 
Intel/ KAI  are  developing  a  new  performance  analysis  toolset  for  combined  MPI 
and  OpenMP  programming  which  uses  PAPI  to  access  the  hardware 
performance  coimters.  PAPTs  standard  performance  metrics,  which  include 
metrics  for  shared  memory  processors  (SMPs),  will  provide  accurate  and 
relevant  performance  data  for  the  clustered  SMP  environments  targeted  by  tine 
new  tool  set. 


5.  Conclusions  and  Future  Work 


OpenMP  implementations  have  matured  and  wUl  continue  to  do  so. 
Implementations  of  OpenMP  2.0  for  Fortran  will  hopefully  begin  to  appear  soon. 
OpenMP  is  becoming  a  viable  option  for  scalable  parallel  programming  on 
shared-memory  platforms.  We  plan  to  continue  our  benchmarking  work  and 
will  investigate  possible  solutions  to  performance  problems  encountered  on 
various  platforms. 

For  example,  when  using  the  C$doacross  directives  on  SGI,  sometimes  the 
optimal  solution  is  to  specify  INTERLEAVE,  which  is  equivalent  to  STATIC 
scheduling  with  a  CHUNK  SIZE  of  1.  Alternatively,  sometimes  the  optimal 
solution  will  be  to  specify  STATIC  and  let  the  CHUNK  SIZE  default.  In  this  case, 
the  default  is  not  1,  rather  it  is  the  largest  CHUNK  SIZE  that  will  result  in  a 
uniform  distribution  of  work  among  the  processors  (within  the  limitations  of 
integer  division).  We  plan  to  investigate  use  of  tins  optimization  with  the  EPCC 
scheduling  benchmark. 


5 


SGI  Origin  3000, 400  MHz,  MIPSPro  f90, 
8  threads 
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Figure  1.  Scheduling  overheads  on  an  SGI  Origin  3000  with  the  vendor  compiler. 
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Figure  2.  Scheduling  overheads  on  an  SGI  Origin  3000  with  the  Guide  compiler. 
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Figure  3.  Scheduling  overheads  on  a  Sun  ElOOOO  with  the  vendor  compiler. 
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Figure  4.  Scheduling  overheads  on  a  Sun  ElOOOO  with  the  Guide  compiler. 
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Figure  5.  Scheduling  overheads  on  an  IBM  PowerS  SMP  with  the  vendor  compiler. 
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Figure  6.  Synchronization  overheads  on  an  SGI  Origin  3000  with  the  vendor  compiler. 
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ire  7.  S5mchronization  overheads  on  an  SGI  Origin  3000  with  the  Guide  compiler. 
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Figure  8.  Synchronization  overheads  on  a  Sun  ElOOOO  with  the  vendor  compiler. 
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Sun  ElOOOO,  400  MHz,  Guide  3.9  (guide  £90) 
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Figure  10.  Synchronization  overheads  on  an  IBM  PowerS  SMP  with  the  vendor  compiler. 
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SGI  Origin  3000, 400  MHz,  MIPSpro  f77 
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Figure  11.  PEN  BT  benchmark  on  an  SGI  Origin  3000  with  the  vendor  compiler. 


SGI  Origin  3000, 400  MHz,  MIPSpro  f77 


CGA 

CGB 

CGC 


Figure  12.  PEN  CG  benchmark  on  an  SGI  Origin  3000  with  the  vendor  compiler. 
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gure  13.  PBN  LU  benchmark  on  an  SGI  Origin  3000  with  the  vendor  compiler. 
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Figure  14.  PBN  SP  benchmark  on  an  SGI  Origin  3000  with  the  vendor  compiler. 
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